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Abstract 

The maximum computational density allowed 
by the laws of physics is available only in a 
format that mimics the basic spatial locality 
of physical law. Fine-grained uniform compu- 
tations with this kind of local interconnectiv- 
ity (Cellular Automata) are particularly good 
candidates for efficient and massive micro- 
physical implementation. 

Conventional computers are ill suited to run 
CA models, and so discourage their develop- 
ment. Nevertheless, we have recently seen 
examples of interesting physical systems for 
which the best computational models are cel- 
lular automata running on ordinary comput- 
ers. By simply rearranging the same quantity 
and quality of hardware as one might find in 
a low-end workstation today, we have made a 
low-cost CA multiprocessor that is about as 
good at large CA calculations as any existing 
supercomputer. This machine's architecture is 
scalable in size (and performance) by orders of 
magnitude, since its 3D spatial mesh organi- 
zation is indefinitely extendable. 

Using a relatively small degree of paral- 
lelism, such machines make possible a level 
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of performance at CA calculations much su- 
perior to that of existing supercomputers, but 
vastly inferior to what a fully parallel CA ma- 
chine could achieve. By creating an interme- 
diate hardware platform that makes a broad 
range of new CA algorithms practical for real 
applications, we hope to whet the appetite 
of researchers for the astronomical computing 
power that can be harnessed in microphysics 
in a CA format. 

1 Introduction 

Within the Information Mechanics Group at 
the MIT Laboratory for Computer Science, 
a primary focus of our research has been 
on the question: "How can computations 
and computers best be adapted to the con- 
straints and opportunities afforded by micro- 
scopic physics?" This has led us to study spa- 
tially organized computations, since the max- 
imum computational density allowed by the 
laws of physics is available only in a format 
that mimics the basic spatial locality of phys- 
ical law. Fine-grained uniform computations 
with this kind of local interconncctivity (Cel- 
lular Automata) are particularly good candi- 
dates for efficient and massive micro-physical 
implementation. 
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We have been involved for over a decade 
in the design and use of machines optimized 
for studying Cellular Automata (CA) compu- 

This involvement 
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tations |4^, 

began in response to our need for more power- 
ful CA simulation tools — suitable for investi- 
gating the large-scale behavior of CA systems. 
Using our early CA machines (CAMs) we de- 
veloped a number of new C A models and mod- 
eling techniques for physics and for spatially- 
structured computation |Q. Eventually we 
"published" a commercial version of our CA 
machines, along with a collection of models as 
software examples p^ . 

Some of our earliest models were reversible 
lattice gases that simulated a billiard-ball com- 
puter It was a natural step to use these 
and related lattice gases to try to simulate 
fluid flow Although only the linear hy- 

drodynamics worked correctly (see Figure 
our CAM simulations made Pomeau and others 
realize that lattice gases were not just concep- 
tual models, but might be turned into powerful 
computational tools (cf. the seminal "FHP" 
lattice-gas paper p2) , and our companion pa- 
per 

The design of our latest CA machine, CAM- 
8, builds upon our accumulated experience 
with previous cellular automata machine de- 
signs, and represents both a conceptual and 
practical breakthrough in our understanding 
of how to efficiently simulate CA systems 
[p?! , [2^ . This new machine is an indefinitely 
scalable three-dimensional mesh-network mul- 
tiprocessor optimized for large inexpensive 
simulations, rather than for ultimate perfor- 
mance. Our small-scale prototype — with an 
amount and kind of hardware comparable to 
that in a low-end workstation — already per- 
forms a wide range of CA simulations at 
speeds comparable to the best numbers re- 
ported for any supercomputer |l^, ^Sj. 
Machines orders of magnitude bigger and pro- 
portionately faster can be built immediately. 



Most of the current exploration of cellular 
automata as computational models for science 
is being done using machines which were de- 
signed for very different purposes. Such exper- 
imentation doesn't make apparent the tremen- 
dous computational power that is potentially 
available to models tailored for uniform arrays 
of simple processors. Nevertheless, we already 
have seen examples of interesting physical sys- 
tems for which the best computational mod- 
els are cellular automata running on ordinary 
computers (cf. 0, H, !§). CAM-8~using a 
relatively small degree of parallelism — makes 
possible a level of performance at CA calcula- 
tions much superior to that of existing super- 
computers, but vastly inferior to what a fully 
parallel CA machine could achieve. By cre- 
ating an intermediate hardware platform that 
makes a broad range of new CA algorithms 
practical for real applications, wc hope to whet 
the appetite of researchers for the astronomi- 
cal computing power that can be harnessed in 
microphysics in a CA format. 



2 An architecture based on 
cellular automata 

In nature, we have a uniform and local law 
in the world that is operating everywhere in 
parallel. A CA model is a synchronous digital 
analog of such a law. As a basis for a com- 
puter architecture, CA's have the advantage 
that there can be a direct mapping between 
the computation and its physical implementa- 
tion: a small region of the computer can im- 
plement a small region of the CA space, and 
adjacent regions of physical space can imple- 
ment adjacent regions of the CA space. Thus 
locality is preserved, and very efficient realiza- 
tions are in principle possible. This efficiency, 
however, comes at the cost of requiring that all 
models run on the machine must be spatially 
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Figure 1: Cam-8 system diagram, (a) A single processing node, with dram site data flowing 
through an SRAM lookup table and back into DRAM, (b) Spatial array of CAM-8 nodes, with 
nearest-neighbor (mesh) interconnect (one wire per bit-slice in each direction). 



organized. Thus the unavoidable problem of 
ultimately making your computation fit into 
a uniform and local physical world is shifted 
into the software domain: you must directly 
embed your software problems into a uniform 
and local spatial matrix. 

Cam-8 is a parallel computer built on this 
spatial paradigm. For technological conve- 
nience, it time-shares individual processors 
over "chunks" of space — and also time-shares 
the wires connecting each processor with its 
neighboring processors. The time-sharing of 
communication resources reduces the number 
of interprocessor wires dramatically and thus 
allows the scalability that is inherent in the 
CA paradigm to be practically achieved using 
current technology, even in three dimensions. 
The time-sharing of processors allows a highly 
efficient "assembly-line" processing of spatial 
data, in which exactly the same operations are 
repeated for every spatial site in a predeter- 
mined order. 

From the viewpoint of the programmer, this 
virtualization of the spatial sites is not appar- 
ent: you simply program the local dynamics 



in a uniform CA space. 

2.1 System overview 

Figure |^ is a schematic diagram of a CAM- 
8 system. On the left is a single hardware 
module — the elementary "chunk" of the archi- 
tecture. On the right is an indefinitely ex- 
tendable array of modules (drawn for conve- 
nience as two-dimensional, the array is nor- 
mally three-dimensional). A uniform spatial 
calculation is divided up evenly among these 
modules, with each module simulating a vol- 
ume of up to millions of fine-grained spatial 
sites in a sequential fashion. 

In the diagram, the solid lines between mod- 
ules indicate a local mesh interconnection. 
These wires are used for spatial data move- 
ment. There is also a tree network (not shown) 
connecting all modules to the front-end work- 
station that controls the CA machine. The 
workstation uses this tree to broadcast sim- 
ulation parameters to some or all modules, 
and to read back data from selected modules. 
Normally, the parameters of the next updat- 
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ing scan of the space are broadcast while the 
current scan is in progress, and analysis data 
from the modules are also read back while the 
current scan runs. 

Each module contains a separate copy of 
the current program for updating the space 
(data transformation parameters, data move- 
ment parameters, etc.), and all modules oper- 
ate in lockstep. This allows both the compu- 
tation within modules and communication be- 
tween modules to be pipelined, so that one vir- 
tual processor within each module completes 
its update (including all communication) at 
each machine clock. 

Spatial site data is kept in conventional 
DRAM chips which are all accessed continu- 
ously in a predictable and optimized scan or- 
der, achieving 100% utilization of the available 
memory bandwidth. Within a module, each 
DRAM chip belongs to a separate bit-slice, and 
each DRAM chip has its address controlled sep- 
arately from the rest. The group of bits that 
are scanned simultaneously (one bit from each 
bit-slice) constitute a hardware cell. Data is 
reshuffled between hardware cells by control- 
ling the relative scan order of the DRAM bit- 
slices. 

Updating is by table lookup. Data comes 
out of the cell-array, is passed through a 
lookup table, and put back exactly where it 
came from (Figure pi) . The lookup tables are 
double buffered, so that the front-end work- 
station can send a new table while the CAM- 
modules are busy using the previous table to 
update the space. There are also hardware 
provisions for replacing the lookup tables with 
pipelined logic (to allow versions of CAM-8 
with a large number of bits in the hardware 
cell — too many to update by table lookup), 
and for connecting external data sources or 
analysis hardware. 

There are only a handful of connections be- 
tween modules — one per bit-slice to each of 
the six adjacent modules. Uniform data shifts 



across the entire three-dimensional space are 
achieved by combining dram address manip- 
ulation with static routing |2^, |l^: data are 
sent over the intermodule wires at preordained 
times, exactly when they are needed by adja- 
cent modules. 

2.1.1 A sample implementation 

For comparison purposes, here is a description 
of the amount and kind of technology used in 
one of our prototype 8-module CAM-8 units: 

• System clock: 25 MHz 

• DRAM: 64 Megabytes (4 Megabit 
chips, 70ns) 

• SRAM: 2 Megabytes (256 Kilobit 
chips, 20ns) 

• Logic: about 2 Million gates total 

• Logic technology: 1.2 micron CMOS 

This level of technology is comparable to what 
is used in a low-end workstation — a small 
CAM-8 unit is really a CA personal computer^ 
For CA rules with one bit per site, this 8 mod- 
ule machine runs simulations at a rate of about 
3 billion site updates per second on spaces of 
up to half a billion sites; with 16 bits per site, 
simulations run at about 200 million site up- 
dates per second on spaces of up to 32 mil- 
lion sites. Several of our 8-module prototypes 
can be connected together to construct bigger 
machines — repackaging the modules would be 
desirable for constructing substantially larger 
machines. 

^In comparing the performance of this unit against 
numbers reported for simulations on supercomputers 
(which have a similar performance) one should also 
take availability into account: a personal computer 
can be run on a single problem for a very long pe- 
riod of time. Economies of scale (mass production) are 
also potentially available to personal-computer level 
hardware. 
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Our CAM-8 prototype can directly accumu- 
late and format data for a real-time video dis- 
play; provision is also made to accept data di- 
rectly from a video camera, in order to allow 
CAM-8 to perform real-time video processing 
with CA rules. For a detailed description of 
the prototype CAM-8 implementation, includ- 
ing the CAM-8 register model, the workstation 
interface, and system configuration and initial- 
ization, see "STEP: a Space Time Event Pro- 
cessor [|9|." 

2.2 Programmer's model 

In addition to more specialized resources hav- 
ing to do with display, analysis, and I/O, the 
main programmable resources in CAM-8 are: 

• Number of dimensions. 

• Size and shape of the space. 

• Number of bits at a site. 

• Initial state of the space. 

• Directions and distances of data 
movement. 

• Rules for data interaction. 

All of these parameters are normally specified 
as part of a CAM-8 experiment. Often the data 
movement and data interaction will change 
with time, either cyclically or progressively as 
the simulation runs: the overhead associated 
with changing these parameters before every 
update of the space is negligible. 

2.2.1 The space 

Our earlier CAM machines were all 2- 
dimensional, with severe restrictions on the 
overall size of the space and the number of 
bits at each spatial site. In CAM-8, these pa- 
rameters may be freely specified. 



The overall space-array is configured as 
a multi-dimensional Cartesian lattice with a 
chosen size, shape, number of bits per site, and 
number of dimensions. The boundaries are 
periodic — if you move from site to site along 
any dimension, you eventually get back to your 
starting point. Three of the dimensions can 
be arbitrarily extended by adding "chunks" of 
hardware (modules). The maximum number 
of bits in the array is of course governed by the 
total amount of storage in all of the modules 
(64 Megabytes in our prototype): each mod- 
ule processes an equal fraction of the overall 
space-array. There is no architectural limit on 
how many modules a CAM-8 machine can have. 

2.2.2 Data movement 

In earlier CAM machines, there were severe 
constraints on neighborhoods: restrictions on 
which data from sites near a given site could 
be seen by the CA update rule acting at that 
site. In CAM-8, we have eliminated these con- 
straints. This was accomplished by abandon- 
ing the use of traditional CA neighborhoods, 
and basing our machine on the kind of data 
partitioning characteristic of lattice gas mod- 
els. Instead of having a fixed set of neighbor- 
hood data visible from each site, we shift data 
around in our space in order to communicate 
information from one place to another. 

In traditional CA rules, each bit at a given 
site is visible to all neighbors. In contrast, 
the pure data-movement used in CAM-8 sends 
each bit in only one direction. Information 
fields move uniformly in various directions, 
each carrying corresponding bits from every 
spatial site along with them — in two dimen- 
sions think of uniformly shifting bit-planes, in 
higher dimensions bit-hyperplanes.^ Interac- 
tions act separately on the data that land at 

^The term information field is a bit of a pun, since 
we intend by this both the computer science meaning, 
namely a fixed set of bits in every "record" (spatial 
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each lattice point, transforming this block of 
data into a new set of bits. If some piece of 
data needs to be sent in two directions, the 
interaction must make two copies. 

There is a constraint on how far bit-fields 
can move in one updating step, but it is quite 
mild. Each bit-field can independently shift 
by a large amount in any direction — the max- 
imum shift-component along each dimension 
is one that would transfer the entire sector of 
a bit-field contained in one hardware module 
into an adjacent module. For a two dimen- 
sional simulation on our prototype, for exam- 
ple, the X and y offsets for each bit-field that 
can be incorporated as part of a single updat- 
ing step can be any pair of signed integers with 
magnitudes of up to a few thousand. In general 
(for any number of dimensions), each updating 
event brings together a selection of bits chosen 
from the few million neighboring sites. 



2.2.3 Data interaction 

Data movement and data interaction alter- 
nate: once we have all the data in the right 
place, we update each site using only the infor- 
mation present at that site.^ In our prototype, 
there is a constraint that only 16 bit-fields can 
be moved in independent directions simultane- 
ously, and only 16 bits at a time can interact 
and be updated arbitrarily (by table lookup).^ 

site), and also the physics meaning of a field, which is 
a number attached to each site in space. 

■^Actually, the hardware does both movement and 
updating in a single pipelined operation. 

■'Alternative implementations (using the same CAM 
data-movement chips) would allow many more simul- 
taneously moving bit-fields, but would use pipelined 
logic in place of lookup tables, since tables grow in 
size exponentially with the number of inputs. Suffi- 
ciently wide programmable logic can perform any de- 
sired many-input function if there are enough levels of 
logic; an arbitrary number of levels can be simulated 
by changing the program for the logic from one scan 
of the space to the next. An interesting application of 
this would be for efficiently running lattice gases with 



Thus a program for this machine consists of 
a sequence of specifications of (wide ranging) 
particle-like data movements and (arbitrary) 
16-bit interaction events. Simulations involv- 
ing the interaction of large numbers of bits at 
each site have to be broken down into a se- 
quence of 16-bit events — a space-time event 
program. 



3 Applications 

Cam-8 is good at spatially moving data, and 
at making the data interact at lattice sites. 
This makes it well suited for simulating phys- 
ical systems using lattice-gas-like dynamics. 
This also makes it appropriate for a wide range 
of other spatially organized calculations in- 
volving localized interactions. 

We are actively collaborating with several 
groups to develop sample applications which 
illustrate the use of this CA machine for phys- 
ical simulations (e.g., fiuid fiow, chemical re- 
actions, polymer dynamics), two and three di- 
mensional image processing (e.g., document 
reading, medical imaging from 3D data), and 
large logic simulations (including the simu- 
lation of highly parallel CA machines). Of 
course all of the models developed for our ear- 
lier CAM machines run well on CAM-8, and 
can now be extended far beyond the capabil- 
ities of these earlier machines. Many spatial 
algorithms (systolic, SIMD, etc.) designed for 
other machines [|l^, ^ can also be adapted to 
this architecture. 

As illustrations of the use of CAM-8, some 
sample applications and simulation techniques 
are discussed below. All of these examples 
have been developed on the prototype machine 
discussed in Section 



2.1.1 



and performance 



figures are for this workstation-scale device. 



large numbers of bits per site (cf . [ po| ) . 
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Figure 2: Flows for two simulations using the FHP lattice gas. (a) Von Karman streets, 
(b) Kelvin-Helmholtz shear instability. 



3.1 Lattice gases 

Cam-8 is at heart a lattice gas machine. Parti- 
cle streaming is an efficient, low-level hardware 
operation, and the large spatial data shifts 
available make it convenient to investigate 
models with widely varying particle speeds. 
Multi-dimensional shifts are useful for investi- 
gating models with shallow extra dimensions. 

Our most advanced lattice gas collaboration 
is with Jeff Yepez and his group at the U.S. Air 
Force's Phillips Laboratory. He and Phillips 
Labs have started a new initiative on geophys- 
ical simulation that involves the construction 
of a large cam-8 machine. 

Geophysical phenomena are good candi- 
dates for lattice dynamics modeling since there 
is so much distributed complexity involved, 
and since many of these phenomena are so 
hard to model using traditional differential 
equation techniques. With lattice gases, the 
simulation runs just as fast with the most com- 
plex boundary condition as with the simplest. 
One can use a great deal of physical intuition 
in incorporating desired properties into models 
by constructing simplified discrete versions of 
the actual physical dynamics. The process of 
making these models is closely akin to that of 
making models in statistical mechanics, where 



one strives to include only the essence of the 
phenomenon |]5^ . 

Figures § and || illustrate some simple 
"warmup" calculations done in collaboration 
with Yepez that illustrate the use of CAM-8's 
statistics gathering hardware. Here, we split 
the system up into bins of a chosen size and use 
lookup tables to count a function of the state 
of the sites in each bin. These event counts are 
continuously reported back to the workstation 
that is controlling the simulation. 

Figure ^ shows momentum flow in a two- 
dimensional 2Kx IK lattice, illustrating vortex 
shedding in lattice-gas flow past a flat plate. 
Here we use a 7-bit "FHP" model, which 
runs on our prototype at a rate of 382 mil- 
lion site updates per second (for pure simula- 
tion). Both the time averages (over 100 steps) 
and the space averages (over 32 x 32 sites) were 
accumulated by CAM; the workstation simply 
drew the arrows. 

Similarly, Figure |^ uses the same model to 
illustrate a Kelvin-Helmholtz shear instability 
on a 4Kx2K lattice. Most of the fluid was 
initially set in motion at Mach 0.4 to the right, 
except for a narrow strip in the middle which 
was started with the opposite velocity. The 
Figure shows the situation after 40,000 time 
steps (about 15 minutes of simulation). The 
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averaging is over regions of 128 x 128 sites, and 
over 50 time steps. 
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Figure 3: Rayleigh-Benard convection. 

Finally, Figure ^ illustrates Rayleigh- 
Benard convection, following [Q. The simula- 
tion uses a 13-bit hexagonal lattice-gas, with 3 
particle speeds, heating (at the bottom), cool- 
ing (at the top), walls around the box, and 
gravity. The simulation size is 1024x512, and 
the prototype runs this at a rate of 191 million 
site updates per second. The time and space 
averaging was done by CAM as in the previous 
figures. 

We have also been working with Bruce 
Boghosian and Dan Rothman on three dimen- 
sional lattice gas models. Since the CM-2 also 
has 16-bit lookup tables, the "random isom- 
etry" techniques that were used to partition 
lattice-gas updates into a composition of 16- 
bit lookups on the Connection Machine carry 
over directly to CAM-8 |§, a 24-particle 
FCHC lattice gas with solid boundaries runs 
at about 7 million site-updates per second. We 
are using these techniques as the basis for im- 
plementing simulations of the flow of immisci- 
ble fluids through porous media p6[ . 

3.2 Statistical mechanics 

Physicists have long used discrete models in 
statistical mechanics to model material sys- 



tems. In simulating such systems it is often 
important to have available large quantities of 
precisely controllable random variables. On 
CAM-8, by independently applying large ran- 
dom spatial shifts to each of a few randomly 
filled bit fields (and by employing other related 
techniques), it is possible to avoid local corre- 
lations and continuously generate high qual- 
ity random variables without slowing the sim- 
ulation down. Using such random variables, 
we have run three dimensional thermalized an- 
nealing models |^ on our 8-module prototype 
at about 200 million site-updates per second 
on a space of 16 million 16-bit sites (about 
12 updates of the 3-dimensional space per sec- 
ond), with simultaneous rendering (by discrete 
ray tracing as part of the CA dynamics) and 
display. Figure ^ shows one rendered image 
from the CAM-8 display for a 512 x 512 x 64 
simulation. 

Figure ^ shows a deterministic simulation 
of a model due to David Griffeath at the Uni- 
versity of Wisconsin. He and some of his col- 
laborators are engaged in the analysis of com- 
binatorial mathematics problems that have 
spatial locality. They have been using our ear- 
lier, much more limited CAM-6 machine in this 
capacity for a number of years fisll . The simu- 
lation shown is a kind of annealing rule: each 
site in the space (512 x 512) takes on whatever 
value is in the majority in its neighborhood. 
The neighborhoods are quite large — they in- 
volve the 121 neighbors in an 11 x 11 region 
surrounding each site. Since there are 5 dif- 
ferent species (3 bits of state), the updating 
rule must deal with 363 bits of state in each 
neighborhood. This is done as a composition 
of about 70 distinct updating steps, and so we 
get about 10 complete updates of the space 
per second (about 2.5 million site updates per 
second). A better algorithm, that doesn't re- 
calculate the species-counts for overlapping re- 
gions of adjacent neighborhoods, would run an 
order of magnitude faster. In either case, this 
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Figure 4: Four materials simulations, (a) Spongy three-dimensional structure obtained by 
"majority" annealing, (b) Typical texture produced by one of Griffeath's large-neighborhood 
voting rules, (c) Diffusion limited aggregation, (c) Polymers diffusing from an initial concen- 
trated region. 
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example serves to illustrate how rules that in- 
volve the interaction of large numbers of bits 
at each site are handled by composing updat- 
ing steps. 1 

Figure |c shows a two-dimensional diffusion 
limited aggregation simulation on a 1024 x 
1024 space, driven by random variables. The 
system shown is started with a single fixed 
particle in the center of about 100,000 ran- 
domly diffusing particles. Whenever a diffuser 
touches a fixed particle, it becomes fixed at 
that position. This is a large version of a 
CAM-6 experiment p6| , but run more than 
two orders of magnitude faster than CAM-6 
could have run it. The simulation performs 
about 800 million site updates per second, and 
the Figure shows the state of the system after 
about two minutes of evolution. 

Figure ^ shows another statistical particle- 
based simulation: a CA polymer model due to 
Yaneer Bar- Yam ( cf . |3^, ; the CAM-8 pro- 
gram was written by Michael Biafore). This 
discrete model captures certain essential fea- 
tures of polymers: conservation of the total 
number of monomers, preservation of connec- 
tivity, monomers can't overlap (excluded vol- 
ume), etc. It employs a statistical dynamics 
(controlled by CAM-8 random variables) that 
uses space-partitioning to maintain these con- 
straints The simulation discussed in Q 
ran at a rate of about 30 million site updates 
per second on a space 512 x 512. Problems that 
are being addressed with these models include 
dynamics in polymer melts, gelation and phase 
separation, polymer collapse, and pulsed field 
gel electrophoresis | |40| . 

Cam-8 is designed to numerically analyze 
the models run on it — largely through the use 
of the event counters Eg]. By appropriately 



^At the opposite extreme of few bits per site, the 
"Bonds Only" [Ml version of Michael Creutz's dynam- 
ical Ising modelH is a 1-bit per site partitioning rule 
that runs at a rate of about 3 billion site updates per 
second on our prototype. 



augmenting the system dynamics with extra 
degrees of freedom, we can make essentially 
any desired property of the system quanti- 
tatively visible. For example, localized spa- 
tial averages (such as density, pressure, en- 
ergy density, temperature, magnetization den- 
sity) can be gathered as we did to produce 
the momentum flows in the previous section; 
global correlation statistics can be accumu- 
lated quickly for occurrences of given spatial 
patterns; autocorrelations can be computed 
by comparing the system to a copy of it- 
self shifted in time and/or space |Q; and 
block-spin transformations can be quickly per- 
formed, simplifying renormalization group cal- 
culations of critical exponents. 

3.3 Data visualization and image 
processing 

Another area we've been exploring is two- and 
three-dimensional image processing. We were 
led into this area initially by the display needs 
of our physical simulations (e.g., see Figure 
discussed above), but this activity has taken 
on a life of its own. 

Our CAM-8 machine simulates a kind of 
raster-scan universe, in which each hardware 
module sequentially scans its chunk of the 
overall simulation space. This raster scan 
can in fact be programmed to be two di- 
mensional, and synchronized and interfaced 
with an external video source. The neces- 
sary hardware is included as part of our pro- 
totype, and allows us to perform realtime 
image processing. Generic bit-map process- 
ing/smoothing/improving techniques are sup- 
ported through a combination of local (CA) 
operations and global statistics gathering via 
the hardware event counters . Well-known 
CA image-processing algorithms, such as those 
used commercially for locating and counting 
objects in images, can also be run efficiently 
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Figure 5: Continuous rotations on a CA machine. Top: 2D rotation of realtime video data. 
Bottom: 3D rotation of MRI data. 
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Many novel algorithms are also directly sup- 
ported by the architecture. For example, the 
8-node prototype can rotate a 512 x 512 bit- 
map image through an arbitrary angle in less 
than 10 milliseconds by permuting the ar- 
rangement of the pixels to move every pixel 
to within one pixel-width of its best possible 
rotated position. Figure ^ shows camera in- 
put of a closeup of the CAM-8 chip (the semi- 
custom chip that knits memory chips together 
into a C A machine) . Figure Ijb shows the same 
image rotated by CAM-8 through an angle of 
35 degrees ||J] 

In three dimensions, local CA techniques 
can be used to find and to smooth two- 
dimensional surfaces to be visualized. For ex- 
ample, magnetic resonance imaging can pro- 
duce three-dimensional arrays of spatial den- 
sity data that subsequently need to be visual- 
ized. Interesting features might be the surface 
of the brain, the surfaces of lesions, blood ves- 
sels, etc. Local rules can be used to trace fea- 
tures (e.g., blood vessels are regions connected 
to segments that have already been identified 
as blood vessels) and to smooth surfaces (e.g., 
using annealing rules that have surface ten- 
sion). Once a surface has been distinguished, 
many bit-map oriented rendering techniques 
are available. The simplest is probably the 
same one used in Figure just simulate 
"photons" of light moving from site to site, en- 
tering the system from one direction, and be- 
ing observed from another. Figures ||c and ||d 
show the surface of the brain generated from 
MRI data, and rendered by such a technique. 
The two images are rotated versions of the 



^This same kind of technique is applicable in other 
contexts. For example, a matrix transpose can be ac- 
complished by a 90 degree rotation and a flip — this 
combined operation on CAM-8 takes the same time as 
the rotation alone. Some of our collaborators (Bryant 
York and Leilei Bao at Northeastern University) are 
performing combinatorial searches on OAM-8 by apply- 
ing these kinds of techniques to large multidimensional 
matrices. 



same data — we can actually do an arbitrary 
three-dimensional rotation of site data using 
the same technique used in Figures |^a and p|b 
in just three updating scans of the space p5| . 

If you render a surface twice, once from 
each of two slightly separated vantage points, 
you can quickly produce stereo pairs. We 
have tested this technique]^ in some of our 
physical simulations: we have run a version 
of the three-dimensional annealing simulation 
pictured in Figure ^ while continuously gen- 
erating such stereo pairs, without slowing the 
simulation down at all. Using this technique 
to generate images from many vantage points, 
one can quickly generate data needed for pro- 
ducing holograms from computer volumetric 
datafi 

3.4 Spacetime circuitry 

Cam-8 can rapidly perform not only arbitrary 
rotations, but also afhne transformations on 
its data — the hardware can skip or repeatedly 
scan sites during updating in order to rescale 
an image. Actually, we can do far better 
than this: CAM-8 can perform arbitrary rear- 
rangements of bits, with any set of local, non- 
uniform operations along the way. To get an 
arbitrary transformation, you simply simulate 
the right logic circuit! 

A digital logic circuit is a physical system 
that (not surprisingly) can be simulated ef- 
ficiently by a (digital) CA space. Figure ||a 
shows a straightforward simulation of logic us- 
ing CAM-Sj^ Here we have a CA space that 
simulates a kind of sea-of-gates gate-array, 
with one gate at each spatial site. Local rout- 
ing information recorded at each site deter- 



^Mike Biafore led this effort. 

* Cam-8 should also be useful for reconstructing 
three-dimensional surfaces from holographic data, 
algorithm implemented by the HORN machine [fe 
should run faster on our CAM-8 prototype than on the 
special-purpose HORN hardware. 

"The circuit shown is due to Ruben Agin. 
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Figure 6: Logic simulation, (a) Gate-array-like CA simulation of a random number generator, 
(b) Spacetime wires reverse the bits in a 1-dimensional space. 



mines how data hops between bit-fields that 
shift in various directions, in order to im- 
plement the wires that connect the gates to- 
gether. Large three dimensional logic simu- 
lations can be performed by CAM-8 in this 
manner: just as with other spatially orga- 
nized computations, the kind of virtualization 
of spatial sites (gates here) that CAM-8 does 
makes such simulations practical.]^ The in- 
vestigation of CA rules that permit efficient 
logic simulation is also important for highly- 
parallel fixed-rule CA hardware: if the fixed 
rule supports logic simulation, then the ma- 
chine can simulate any other CA rule by tiling 
the simulation space with appropriate blocks 
of logic .|^ 



^" Since CAM-8 shares each processor over up to a 
few million spatial sites, much higher performance spe- 
cialized machines with a lower virtualization ratio can 
be made to implement specific CAM-8 rules — such as a 
logic simulation rule, or an image processing rule. You 
trade flexibility (large spatial shifts and large lookup 
tables) and simulation size for speed. Notice that even 
if fpga's are used for implementing these specialized 
machines, very high silicon utilization ratios can be 
achieved, since the regular structure of a CA maps 
naturally onto the regular structure of an FPGA. 

^^The idea of using CA's to do logic is quite old. In 



Now consider the problem of producing 
rather general transformations of the data in 
our CA space. One approach would be to di- 
rectly simulate a gate-array-like rule that op- 
erates on the original data, and eventually pro- 
duces the transformed data. An efficient tech- 
nique for doing this on CAM-8 is called space- 
time circuitry. This involves adding an extra 
dimension to your system to hold the transfor- 
mation circuitry, laid out as a pipeline in which 
each stage is evaluated only once, as the data 
passes through it 0. 

As a simple example, consider a 1- 
dimensional space where the desired transfor- 
mation is to reverse the order of the data bits 
across the width of the space. We add a di- 
mension (labeled u in Figure ^d) and draw a 
circuit that accomplishes the reversal — in this 
simple case, we only need wires. The circuit 
shown is a data pipeline that copies informa- 
tion up one row at each stage, and possibly 
over by one position right or left: the infor- 
mation about which way the data should go is 



fact, much of the present work on field programmable 
gate arrays carries forward ideas that originated in 
early work on CA's (cf. [||, |l6[ |3|| . ) 
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stored locally. The CAM-8 rule that achieves 
this transformation only involves 5-bit sites — 
two bits of stationary routing information, and 
three shifting bit-fields to transport the sig- 
nals. If we continually add new information 
at the bottom of the picture, reversed data 
continually appears, with a 10-stage propaga- 
tion delay, at the top. But if we only want to 
accomplish the transformation once, then we 
only need to update each consecutive row of 
the circuit once, moving the signals up to the 
next row before we update it in turn. In this 
case, instead of one update of the space mov- 
ing the whole pipeline forward by one stage, 
the row by row update will move one set of 
data all the way through the pipeline!]^ We 
still get one result per update of the space 
(exactly as before), but the propagation de- 
lay has been reduced to a single scan of the 
space! Thus given a CA space, by adding 
a dimension containing a sufficiently compli- 
cated pipelined circuit, any desired transfor- 
mation of the original space can be achieved 
in one scan of the augmented space — limited 
only by the total amount of space available for 
the extra-dimensional circuitry. 

If the problem we're interested in is the sim- 
ulation of a clocked logic circuit, this technique 
can be used to greatly speed up the simulation. 
Instead of updating our CA space over and 
over again while signals propagate around the 
system, passing through gates and eventually 
being latched in preparation for the next clock 
cycle, we can pipeline this calculation using an 
extra dimension, and perform the entire clock 
cycle in a single update of the space. Since 
the total volume of space (number of sites) 
needed to represent all of the gates and wires 
should be comparable to the volume without 



^ The rendering algorithm of Figure uses essen- 
tially this technique to propagate the light all the way 
through the material system in a single scan of the 
space. 



the pipeline dimension,Q| this represents an 
enormous speedup. If we think of the rout- 
ing and gate information that is spread out in 
the pipeline dimension as being spread out in 
time, then we greatly reduce the space needed 
for the calculation by making what happens 
at each spot time dependent — hence the term 
spacetime circuitry. 

An additional benefit of spacetime circuitry 
on CAM-8 is that it allows us to take good 
advantage of the large spatial shifts that are 
available in this architecture. In the logic ex- 
ample, we could use big spatial shifts at some 
stages of the pipeline, and smaller ones at 
other stages, in order to route all signals in as 
few stages as possible — this provides a further 
speedup of the simulation. Of course these 
sorts of techniques (extra dimensions and big 
shifts) will not be applicable to fully paral- 
lel CA machines built at the most microscopic 
scale, but they add greatly to the power and 
flexibility of our virtual-processor implemen- 
tation. 

4 Software 

During the design of the CAM-8 ASIC, we de- 
cided to implement a version of the software 
that would drive the real hardware, and use 
that to drive complete system simulations of 
the CAM-8 hardware, including the worksta- 
tion interface hardware. Thus when the hard- 
ware arrived, we immediately had software 
that would drive it, and could run the same 
tests that we used to validate the design. 

This initial software was intentionally rather 
low level, since it was necessary to have low 
level access and control to thoroughly and ef- 
ficiently drive gate-level simulations that ran 
eight orders of magnitude slower than a single 

^■^ Since routing signals in a higher dimension is gen- 
erally much easier than in a lower dimension, the cir- 
cuit should actually be more compact. 
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Figure 7: Sample experiment. 



module of the actual hardware. The present 
(still rather rudimentary) CAM-8 systems soft- 
ware has been built as several layers on top of 
this initial work. It provides a prototypical 
programming environment for CAM-8 which 
demonstrates how to access and control all 
facets of the hardware. 

4.1 A high level machine lan- 
guage 

For simple CA models running on regular crys- 
tal lattices, the mapping between the model 
and the CAM-8 architecture is quite direct.^ 
To illustrate this direct mapping for the sim- 
plest lattice gas model, Figure ^ shows a CAM- 
8 assembly language program for running the 
HPP lattice gas . This program translates 
into about a dozen CAM-8 machine-language 
instructions to be broadcast to CAM. It has 
two main parts: a rule definition, and a def- 
inition of what constitutes an updating step. 
The updating step broadcasts the rule, adjusts 
some CAM data paths, specifies some uniform 
data movements of the four bit-fields used to 
transport particles, and initiates a scan of the 
space. Despite being at such a low level, this 

^''Embedding any regular lattice into cam's Carte- 
sian lattice generally involves combining several adja- 
cent sites of the original lattice into one CAM site. 



program runs without change on a machine 
with any number of modules.]^ Issues such as 
making the data move smoothly across module 
boundaries are handled directly by the hard- 
ware. 

Figure ^ also shows a "snapshot" from the 
CAM-8 display of a sound pulse resulting when 
this exact code is run from an initial pattern of 
random particle data with a cavity (a 64x64 
particle vacuum) in the center. 

4.2 Zero-module scalability 

The CAM-8 machine language is directly inter- 
preted by the hardware interface that resides 
in the workstation that controls CAM. This 
language forms a sharp and simple boundary 
between the software and the hardware — all 
interaction gets tunneled through this inter- 
face. A software simulator of CAM-8 has only 
to correctly interpret this machine language 
in order to be compatible with all higher level 
software written for CAM. 

Since the CAM-8 architecture depends so 
heavily on data movement by pointer manip- 
ulation and updating by table lookup, it is in 

Utilities that download initial patterns and that 
manage the video display are not shown here — the 
lowest-level interaction of these routines with CAM de- 
pends explicitly on the number of modules. 
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fact very well suited to direct software simu- 
lation on serial machines. A functionally ac- 
curate software simulator of CAM-8 has been 
constructed for the Sun SPARCstation which 
runs CA models about as fast as the best ex- 
isting CA simulators for that machine — as fast 
as simulators that are not burdened with the 
constraint of also simulating CAM-8 function- 
ality p. 

This property that CAM-8 simulations have 
of running well even in a pure software context 
we sometimes refer to as zero-module scalabil- 
ity. Efficient simulability on a variety of par- 
allel and serial architectures should encourage 
the use of the CAM-8 machine model as a stan- 
dard for CA work — which would make other 
CAM-8 software efforts much more widely use- 
ful. Applications developed on faithful soft- 
ware simulators (and on small CAM-8 instal- 
lations) will be directly transferable to large 
CAM-8 machines when faster or more massive 
simulations are needed. 

4.3 Programming environment 

For specific applications, it will be the sim- 
ulation context that defines the "high level" 
programming environment. For a logic simula- 
tion, the high level environment might include 
hardware description languages, logic synthe- 
sizers, chip-model libraries, etc. For a fluid 
simulation, the high level environment might 
allow one to "design" a wind-tunnel, obstacles, 
probes, etc. In general, one needs facilities for 
conveniently producing interesting initial con- 
ditions, for visualizing the state of the system, 
for monitoring and analyzing the progress of 
the simulation, etc. Our task here is to pro- 
vide examples, utilities, and "hooks" to facil- 
itate the construction and integration of such 
environments. 

For developing models, one great simplifica- 
tion has been the sharing of code that is possi- 
ble between models that employ a similar spa- 



tial format. For example, we have constructed 
a set of libraries that specialize the CAM-8 ma- 
chine to run CAM-6 style neighborhoods on 
variable-sized two-dimensional spaces. This 
allows generic mechanisms for display, anal- 
ysis, etc. to be shared, allowing the pro- 
grammer to concentrate on developing models. 
These libraries serve both to allow the exper- 
iments and experience of CAM-6 to be applied 
rapidly to this new domain and to allow users 
to develop applications in a simplified and well 
documented context. The library routines also 
serve as examples of how to directly program 
CAM-8 itself. 

The task of providing high-level tools for 
model development has barely begun. Some 
of the work involves only software engineer- 
ing: for example, writing good compilers that 
can automatically partition a rule on sites with 
many bits into a composition of 16-bit opera- 
tions would be a valuable aid (cf. [|llj). Com- 
pilers that can perform specified transforma- 
tions on a space by constructing spacetime cir- 
cuitry would be similarly valuable. Access to 
arithmetic array operations directly on CAM- 
8 would be useful not only for model build- 
ing, but for model analysis. High level de- 
bugging tools that let one quickly compare a 
model's behavior against expectations are es- 
sential. Where adequate models exist, work 
needs to be done on parameterizing known 
modeling techniques and ways of combining 
models. 

4.4 Theoretical challenges 

Ideally, one would like to be able to specify a 
very high level description of a physical sys- 
tem, and have software use some set of corre- 
spondence rules to generate an efficient, fine- 
grained CA model of that system. In gen- 
eral, we don't know how to do this. Present 
modeling techniques are rather ad hoc, and 
the best progress has been made by "dressing 
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up" lattice gases by adding additional particle 
species and interactions, resulting in complex 
models with large numbers of bits at each site. 
Such models are ill suited to an ultimate goal 
of harnessing fine-grained, high-density micro- 
physical systems for CA computations. Fur- 
thermore, there are at present no fine-grained 
CA models of many basic physical phenomena, 
such as motion of an elastic-solid, long-range 
forces, or relativistic effects. 

We know that more general methods of con- 
structing models are possible. For example, 
the numerical integration of a finite difference 
equation is actually a type of CA computation, 
and it can reproduce a differential equation. 
This correspondence, however, yields a rather 
restricted class of CA rules, constrained to use 
only arithmetic operations and large numbers 
of bits at each site. Without these constraints, 
other general methods may be possible which 
yield much simpler CA rules that also repro- 
duce a desired macroscopic dynamics — rules 
better suited to high-density microphysical im- 
plementation. Finding such general methods 
is an open problem. 

Many basic questions remain in the develop- 
ment and analysis of CA models, and progress 
on their resolution will both facilitate, and be 
facilitated by, the use of CA machines. 

5 Conclusions 

By exploiting the uniformity of a virtual pro- 
cessor simulation of fully parallel CA hard- 
ware, we were able to make workstation- 
class hardware outperform supercomputers for 
many CA simulation tasks. Using the same 
technology, a new generation of largescale CA 
machines becomes possible that will make en- 
tirely new classes of spatially organized com- 
putations practicable. Our aim in all of this 
has been to promote the development of CA 
models that can begin to harness the astro- 



nomical computing power that is available, in 
a CA format, in microphysics. 

As stated, this goal is directed toward bring- 
ing computational models closer to physics in 
order to improve computation, not physics. 
But computational models that match well 
with microphysics also tell us something about 
the structure of information dynamics in 
physics. Since a finite physical system has a 
finite entropy, not only computer science but 
also physics itself must deal with the dynam- 
ics of finite-information systems at increas- 
ingly microscopic scales [^. Thus it seems 
possible that promoting the development of 
physics-like computational models will one day 
contribute to the conceptual development of 
physics itself. 
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